Full-text and Keyword Indexes for String Searching

نویسنده

  • Aleksander Cislak
چکیده

String searching consists in locating a substring in a longer text, and two strings can be approximately equal (various similarity measures such as the Hamming distance exist). Strings can be defined very broadly, and they usually contain natural language and biological data (DNA, proteins), but they can also represent other kinds of data such as music or images. One solution to string searching is to use online algorithms which do not preprocess the input text, however, this is often infeasible due to the massive sizes of modern data sets. Alternatively, one can build an index, i.e. a data structure which aims to speed up string matching queries. The indexes are divided into full-text ones which operate on the whole input text and can answer arbitrary queries and keyword indexes which store a dictionary of individual words. In this work, we present a literature review for both index categories as well as our contributions (which are mostly practice-oriented). The first contribution is the FM-bloated index, which is a modification of the well-known FM-index (a compressed, full-text index) that trades space for speed. In our approach, the count table and the occurrence lists store information about selected q-grams in addition to the individual characters. Two variants are described, namely one using O(n log n) bits of space with O(m + logm log logn) average query time, and one with linear space and O(m log log n) average query time, where n is the input text length and m is the pattern length. We experimentally show that a significant speedup can be achieved by operating on q-grams (albeit at the cost of very high space requirements, hence the name “bloated”). In the category of keyword indexes we present the so-called split index, which can efficiently solve the k-mismatches problem, especially for 1 error. Our implementation in the C++ language is focused mostly on data compaction, which is beneficial for the search speed (by being cache friendly). We compare our solution with other algorithms and we show that it is faster when the Hamming distance is used. Query times in the order of 1 microsecond were reported for one mismatch for a few-megabyte natural language dictionary on a medium-end PC. A minor contribution includes string sketches which aim to speed up approximate string comparison at the cost of additional space (O(1) per string). They can be used in the context of keyword indexes in order to deduce that two strings differ by at least k mismatches with the use of fast bitwise operations rather than an explicit verification.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Approximate String Matching with Compressed Indexes

A compressed full-text self-index for a text T is a data structure requiring reduced space and able to search for patterns P in T . It can also reproduce any substring of T , thus actually replacing T . Despite the recent explosion of interest on compressed indexes, there has not been much progress on functionalities beyond the basic exact search. In this paper we focus on indexed approximate s...

متن کامل

7. Full-Text Indexes in External Memory

A full-text index is a data structure storing a text (a string or a set of strings) and supporting string matching queries: Given a pattern string P , find all occurrences of P in the text. The best-known full-text index is the suffix tree [761], but numerous others have been developed. Due to their fast construction and the wealth of combinatorial information they reveal, full-text indexes (an...

متن کامل

Approximate String Matching with Lempel-Ziv Compressed Indexes

A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T . Furthermore, the structure can reproduce any substring of T , thus it actually replaces T . Despite the explosion of interest on self-indexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper...

متن کامل

DHT Based Searching Improved by Sliding Window

Efficient full-text searching is a big challenge in Peer-to-Peer (P2P) system. Recently, Distributed Hash Table (DHT) becomes one of the reliable communication schemes for P2P. Some research efforts perform keyword searching and result intersection on DHT substrate. Two or more search requests must be issued for multi-keyword query. This article proposes a Sliding Window improved Multi-keyword ...

متن کامل

Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes

Exact string matching is a problem that computer programmers face on a regular basis, and full-text indexes like the suffix tree or the suffix array provide fast string search over large texts. In the last decade, research on compressed indexes has flourished because the main problem in large-scale applications is the space consumption of the index. Nowadays, the most successful compressed inde...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1508.06610  شماره 

صفحات  -

تاریخ انتشار 2015